IE 582 Homework 1 - FALL'24¶
Berat Kubilay Güngez - 2021402087
1. Introduction ¶
As high-frequency communication technologies like 5G evolve, designing efficient antennas has become crucial. Antenna performance, often evaluated by the S11 parameter, requires computationally intensive electromagnetic (EM) simulations, making traditional trial-and-error methods impractical. To address this, machine learning provides a data-driven approach to model and predict antenna characteristics based on design parameters. This assignment uses techniques like Principal Component Analysis (PCA) for dimensionality reduction and linear regression for predictive modeling to better understand and simplify the complex relationships within antenna design, aiming to improve efficiency in creating high-performance systems.
2. Related Literature ¶
In electrical network analysis, S-parameters (scattering parameters) describe how an electromagnetic signal interacts at different network ports, typically in high-frequency circuits. The S11 parameter, a specific type of S-parameter, measures the reflection at the input port—indicating how much signal is reflected back rather than transmitted. This reflection coefficient helps engineers assess antenna efficiency and impedance matching, critical for minimizing signal loss in RF designs. S-parameters like S11 are widely used for their practicality in evaluating performance across frequencies without needing complex internal device details.
For more, see Ansys on S-parameters.
3. Data Preprocessing and Analysis ¶
3a. Importing Libraries ¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
3b. Inspecting Output Data ¶
To comprehend the characteristics of S11 parameter data across various frequencies, the real and imaginary components of the data are imported. Then, the magnitude of the data is computed. Given that the magnitude of the S11 parameters encapsulates its performance, lower S11 parameter values indicate reduced signal reflection.
real_ouput_data_loc = "data/hw1_real.csv"
real_output_df = pd.read_csv(real_ouput_data_loc)
img_output_data_loc = "data/hw1_img.csv"
img_output_df = pd.read_csv(img_output_data_loc)
output_df = np.sqrt(real_output_df **2 + img_output_df **2) # Magnitude of the S11 parameter
output_df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.781778 | 0.783440 | 0.785795 | 0.788790 | 0.792359 | 0.796433 | 0.800936 | 0.805793 | 0.810930 | 0.816277 | ... | 0.988520 | 0.988551 | 0.988581 | 0.988610 | 0.988638 | 0.988664 | 0.988690 | 0.988714 | 0.988737 | 0.988759 |
| 1 | 0.986860 | 0.986669 | 0.986470 | 0.986263 | 0.986048 | 0.985824 | 0.985592 | 0.985350 | 0.985100 | 0.984839 | ... | 0.945061 | 0.945880 | 0.946669 | 0.947428 | 0.948158 | 0.948861 | 0.949537 | 0.950188 | 0.950814 | 0.951417 |
| 2 | 0.866883 | 0.865643 | 0.864258 | 0.862724 | 0.861039 | 0.859198 | 0.857199 | 0.855036 | 0.852705 | 0.850202 | ... | 0.865665 | 0.861328 | 0.856551 | 0.851268 | 0.845404 | 0.838870 | 0.831564 | 0.823368 | 0.814148 | 0.803752 |
| 3 | 0.995069 | 0.995055 | 0.995041 | 0.995025 | 0.995009 | 0.994991 | 0.994973 | 0.994953 | 0.994933 | 0.994912 | ... | 0.935682 | 0.932755 | 0.929637 | 0.926312 | 0.922765 | 0.918978 | 0.914932 | 0.910607 | 0.905982 | 0.901033 |
| 4 | 0.985009 | 0.985235 | 0.985447 | 0.985645 | 0.985831 | 0.986005 | 0.986169 | 0.986321 | 0.986464 | 0.986597 | ... | 0.988472 | 0.988418 | 0.988361 | 0.988304 | 0.988244 | 0.988183 | 0.988121 | 0.988057 | 0.987991 | 0.987924 |
5 rows × 201 columns
To begin, a single output is selected to analyze the distribution of the real part, imaginary part, and magnitude in comparison to one another. The third sample was chosen due to its characteristics.
plt.figure(figsize=(12, 6))
plt.plot(output_df.iloc[2], label="Magnitude")
plt.plot(real_output_df.iloc[2], label="Real part")
plt.plot(img_output_df.iloc[2], label="İmaginary part")
plt.title("S11 Parameter of the 3rd Sample")
plt.legend(loc="best", fontsize=10)
plt.xticks(range(0, 200, 25))
plt.xlabel("Frequency")
plt.show()
Then first 15 samples are plotted to see different characteristics.
plt.figure(figsize=(12, 6))
for i in range(0,15):
plt.plot(output_df.iloc[i], label=f"output {i}")
plt.legend(loc="best", fontsize=6)
plt.title("Magnitude of the first 15 outputs")
plt.xticks(range(0, 200, 25))
plt.xlabel("Frequency")
plt.show()
As can be seen in the plot above, each sample exhibits distinct characteristics at various frequency values. Among these samples, output 2 appears to perform best around the 60th frequency value compared to others. Approximately six outputs reflect nearly all of the signal back meaning that they do not perform well.
To apply predictive approaches later, data needs to be simplified in a way that there will be only one output value that captures the variability of all frequencies. One option seems to be using minimum values. To make it more applicable, the average of the minimum of 25 values is calculated.
min_values_output = output_df.apply(lambda x: x.nsmallest(25).mean(), axis=1)
min_values_index = output_df.idxmin(axis=1)
min_values_real = real_output_df.apply(lambda x: x.nsmallest(25).mean(), axis=1)
min_values_img = img_output_df.apply(lambda x: x.nsmallest(25).mean(), axis=1)
The plot below illustrates the output values that will be generated in the event of utilizing the average of the 25 minimum values.
plt.figure(figsize=(12, 6))
for i in range(0,10):
plt.plot(output_df.iloc[i], label=f"output {i}")
plt.scatter(min_values_index[i], min_values_output[i], color="red")
plt.legend(loc="best", fontsize=6)
plt.title("Magnitued of output 0 to 9 with their avg. minimum values")
plt.xticks(range(0, 200, 25))
plt.xlabel("Frequency")
plt.show()
One other option is applying PCA to the output data. PCA is a dimensionality reduction technique that transforms data into a lower-dimensional space while preserving the variance of the data. By applying PCA to the output data, the number of output variables can be reduced to a smaller set of principal components that capture the most significant variance in the data.
pca = PCA(n_components=1) # Only one component is selected for simplicity
principal_components = pca.fit_transform(output_df)
explained_variance_df = pd.DataFrame({
'Standard deviation': np.sqrt(pca.explained_variance_),
'Proportion of Variance': pca.explained_variance_ratio_,
}, index=[f'Comp.{i+1}' for i in range(len(pca.explained_variance_))])
explained_variance_df
| Standard deviation | Proportion of Variance | |
|---|---|---|
| Comp.1 | 2.400357 | 0.610351 |
Results of Principal Component Analysis (PCA) indicate that 61% of the variability in the magnitude of S11 parameters can be captured by a single output. This finding is advantageous as variability is a crucial factor that is learned during the training process of Machine Learning models. Same technique is applied to the real and imaginary parts of the data as well.
pca = PCA(n_components=1) # Only one component is selected for simplicity
principal_components_real = pca.fit_transform(real_output_df)
explained_variance_df = pd.DataFrame({
'Standard deviation': np.sqrt(pca.explained_variance_),
'Proportion of Variance': pca.explained_variance_ratio_,
}, index=[f'Comp.{i+1}' for i in range(len(pca.explained_variance_))])
explained_variance_df
| Standard deviation | Proportion of Variance | |
|---|---|---|
| Comp.1 | 8.817202 | 0.855197 |
PCA on the real part of the data is performed even better. Suggests a single value can capture 85% of the variability.
pca = PCA(n_components=1) # Only one component is selected for simplicity
principal_components_img = pca.fit_transform(img_output_df)
explained_variance_df = pd.DataFrame({
'Standard deviation': np.sqrt(pca.explained_variance_),
'Proportion of Variance': pca.explained_variance_ratio_,
}, index=[f'Comp.{i+1}' for i in range(len(pca.explained_variance_))])
explained_variance_df
| Standard deviation | Proportion of Variance | |
|---|---|---|
| Comp.1 | 3.135929 | 0.419402 |
PCA on the imaginary part of the data performs worse compared to other data sets. This suggests that a single value can only capture 49% of the variability. However, this approach can still be effective in certain applications, depending on the specific use case.
3c. Inspecting Input Data ¶
input_data_loc = "data/hw1_input.csv"
input_df = pd.read_csv(input_data_loc)
input_df.head()
| length of patch | width of patch | height of patch | height of substrate | height of solder resist layer | radius of the probe | c_pad | c_antipad | c_probe | dielectric constant of substrate | dielectric constant of solder resist layer | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.202024 | 2.288742 | 0.012514 | 0.139247 | 0.041757 | 0.028566 | 0.000549 | 0.032403 | 0.348140 | 3.735926 | 4.278575 |
| 1 | 2.107848 | 2.895504 | 0.037171 | 0.149492 | 0.056775 | 0.028930 | 0.005536 | 0.053647 | 0.326369 | 4.929862 | 4.876068 |
| 2 | 3.252113 | 4.818411 | 0.025432 | 0.578834 | 0.029972 | 0.030922 | 0.020274 | 0.049845 | 0.446639 | 4.772670 | 4.745106 |
| 3 | 4.161509 | 2.294309 | 0.011058 | 0.117266 | 0.093223 | 0.017604 | 0.001135 | 0.098610 | 0.055665 | 4.102438 | 3.755671 |
| 4 | 4.820912 | 2.948325 | 0.019658 | 0.163503 | 0.094337 | 0.025757 | 0.021725 | 0.072813 | 0.272282 | 2.531031 | 3.047553 |
Input values are composed of several geometric features. As can be observed in the accompanying table, these features have distinct ranges. Consequently, it is necessary to scale the data to prevent any potential effects. This scaling process is performed after an in-depth examination of the data as it is.
input_df.describe() # Descriptive statistics of the input data
| length of patch | width of patch | height of patch | height of substrate | height of solder resist layer | radius of the probe | c_pad | c_antipad | c_probe | dielectric constant of substrate | dielectric constant of solder resist layer | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 | 385.000000 |
| mean | 3.569210 | 3.536983 | 0.024273 | 0.347643 | 0.060065 | 0.032198 | 0.012797 | 0.060648 | 0.245586 | 3.704384 | 3.521911 |
| std | 0.966173 | 1.182100 | 0.008800 | 0.272738 | 0.023670 | 0.010352 | 0.007111 | 0.021503 | 0.111245 | 0.853877 | 0.871233 |
| min | 1.805658 | 1.801273 | 0.010008 | 0.100321 | 0.020039 | 0.015012 | 0.000003 | 0.025292 | 0.050810 | 2.023380 | 2.001679 |
| 25% | 2.755534 | 2.501163 | 0.016194 | 0.126901 | 0.038689 | 0.023389 | 0.006985 | 0.042011 | 0.148565 | 2.998152 | 2.783710 |
| 50% | 3.637716 | 3.215396 | 0.024198 | 0.155254 | 0.060764 | 0.030979 | 0.012454 | 0.060532 | 0.245049 | 3.866295 | 3.480916 |
| 75% | 4.369311 | 4.829731 | 0.031688 | 0.649324 | 0.080247 | 0.041819 | 0.019014 | 0.078227 | 0.340203 | 4.375551 | 4.278575 |
| max | 5.199919 | 5.198689 | 0.039843 | 0.799082 | 0.099728 | 0.049960 | 0.024996 | 0.099945 | 0.449599 | 4.999324 | 4.999950 |
Besides the relation between height of substrate and width of patch, there doesn’t seem to be a clear relationship between the input features.
input_df.corr() # Correlation matrix
| length of patch | width of patch | height of patch | height of substrate | height of solder resist layer | radius of the probe | c_pad | c_antipad | c_probe | dielectric constant of substrate | dielectric constant of solder resist layer | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| length of patch | 1.000000 | -0.114174 | -0.026032 | -0.064344 | 0.044502 | -0.069905 | -0.005560 | -0.009627 | 0.081735 | -0.037448 | -0.014496 |
| width of patch | -0.114174 | 1.000000 | 0.091726 | 0.923739 | -0.021056 | 0.035675 | -0.044198 | -0.013358 | 0.029999 | 0.442193 | 0.038746 |
| height of patch | -0.026032 | 0.091726 | 1.000000 | 0.082833 | 0.036045 | 0.030689 | 0.002422 | 0.037583 | 0.031592 | -0.044318 | -0.037769 |
| height of substrate | -0.064344 | 0.923739 | 0.082833 | 1.000000 | -0.011870 | 0.017410 | -0.031571 | -0.009432 | 0.044166 | 0.459847 | 0.002421 |
| height of solder resist layer | 0.044502 | -0.021056 | 0.036045 | -0.011870 | 1.000000 | -0.012756 | -0.018044 | -0.017071 | 0.024842 | -0.010967 | -0.038405 |
| radius of the probe | -0.069905 | 0.035675 | 0.030689 | 0.017410 | -0.012756 | 1.000000 | 0.002906 | -0.000523 | 0.048088 | -0.025728 | 0.007524 |
| c_pad | -0.005560 | -0.044198 | 0.002422 | -0.031571 | -0.018044 | 0.002906 | 1.000000 | 0.067678 | -0.015500 | -0.049318 | 0.015284 |
| c_antipad | -0.009627 | -0.013358 | 0.037583 | -0.009432 | -0.017071 | -0.000523 | 0.067678 | 1.000000 | -0.132321 | -0.016741 | 0.067821 |
| c_probe | 0.081735 | 0.029999 | 0.031592 | 0.044166 | 0.024842 | 0.048088 | -0.015500 | -0.132321 | 1.000000 | 0.015640 | -0.029101 |
| dielectric constant of substrate | -0.037448 | 0.442193 | -0.044318 | 0.459847 | -0.010967 | -0.025728 | -0.049318 | -0.016741 | 0.015640 | 1.000000 | 0.060402 |
| dielectric constant of solder resist layer | -0.014496 | 0.038746 | -0.037769 | 0.002421 | -0.038405 | 0.007524 | 0.015284 | 0.067821 | -0.029101 | 0.060402 | 1.000000 |
Again, besides the height of substrate and the width of patch, there doesn’t seem to be any interesting relation between features.
pair_plot = sns.pairplot(input_df, kind='scatter', diag_kind='kde', markers='o', plot_kws={'alpha':0.5}) # Pair plot of the input data
plt.show()
Now, let us see the relation between the height of substrate and the width of the patch more closely. As can be seen below, these two are scattered in two parts.
plt.figure(figsize=(12, 6))
plt.scatter(input_df["height of substrate"], input_df["width of patch"])
plt.title("Width of Patch vs Height of Substrate")
plt.xlabel("Height of Substrate")
plt.ylabel("Width of Patch")
plt.show()
To reduce the dimensions of the space, variables can be combined as follows.
input_df["width of patch combined with height of substrate"] = np.where((input_df["width of patch"] > 4) & (input_df["height of substrate"] > 0.4), 1, 0)
input_df.drop(["width of patch", "height of substrate"], axis=1, inplace=True)
Plot below suggests that the combined relation of these two variables may help us to capture the dynamics of the minimum magnitude. Therefore, this manipulation will be used in the following models.
color = {0: "red", 1: "blue"} # Color mapping for the scatter plot to show the combined feature
plt.figure(figsize=(10, 6))
plt.scatter(range(len(min_values_index)), min_values_output,
c=input_df["width of patch combined with height of substrate"].map(color))
plt.title("Minimum Magnitudes with respect to width of patch combined with height of substrate")
plt.ylabel("Minimum Magnitudes")
plt.xticks([])
plt.show()
Data needs to be scaled as previously discussed. To reduce the effects of outliers(if they exist), standardization is applied.
scaler = StandardScaler()
input_df_scaled = pd.DataFrame(scaler.fit_transform(input_df), columns=input_df.columns)
input_df_scaled.head()
| length of patch | height of patch | height of solder resist layer | radius of the probe | c_pad | c_antipad | c_probe | dielectric constant of substrate | dielectric constant of solder resist layer | width of patch combined with height of substrate | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.380536 | -1.338036 | -0.774472 | -0.351267 | -1.724662 | -1.315281 | 0.923074 | 0.036988 | 0.869628 | -0.798935 |
| 1 | -1.514495 | 1.467538 | -0.139201 | -0.316141 | -1.022443 | -0.326035 | 0.727110 | 1.437061 | 1.556322 | -0.798935 |
| 2 | -0.328626 | 0.131827 | -1.272998 | -0.123460 | 1.052887 | -0.503034 | 1.809644 | 1.252729 | 1.405808 | 1.251666 |
| 3 | 0.613834 | -1.503664 | 1.402635 | -1.411656 | -1.642198 | 1.767754 | -1.709451 | 0.466780 | 0.268659 | -0.798935 |
| 4 | 1.297212 | -0.525081 | 1.449735 | -0.622998 | 1.257188 | 0.566491 | 0.240287 | -1.375936 | -0.545176 | -0.798935 |
Correlation values between features and the average 25 minimum values of magnitude values are calculated below. It suggests the previously manipulated combined feature has a high correlation and may be effective in linear models.
input_df_scaled.corrwith(min_values_output).sort_values(ascending=False)
length of patch 0.149842 c_pad 0.069841 c_antipad 0.030746 height of solder resist layer 0.013917 radius of the probe -0.007607 c_probe -0.019620 height of patch -0.063299 dielectric constant of solder resist layer -0.075817 dielectric constant of substrate -0.481474 width of patch combined with height of substrate -0.886938 dtype: float64
Now, let’s apply the PCA method to further simplify the input data set.
pca_df = input_df_scaled.copy()
pca_df.drop("width of patch combined with height of substrate", axis=1, inplace=True) # Dropping the categorical feature
Categorical feature is dropped because one of the PCA's assumptions is that the data is numeric and is distributed normally.
pca = PCA() # Do not limit the number of components to see the explained variance of all components
principal_components = pca.fit(pca_df)
explained_variance_df = pd.DataFrame({
'Standard deviation': np.sqrt(pca.explained_variance_),
'Proportion of Variance': pca.explained_variance_ratio_,
'Cumulative Proportion': np.cumsum(pca.explained_variance_ratio_)
}, index=[f'Comp.{i+1}' for i in range(len(pca.explained_variance_))])
explained_variance_df
| Standard deviation | Proportion of Variance | Cumulative Proportion | |
|---|---|---|---|
| Comp.1 | 1.104823 | 0.135274 | 0.135274 |
| Comp.2 | 1.059588 | 0.124423 | 0.259697 |
| Comp.3 | 1.042818 | 0.120516 | 0.380213 |
| Comp.4 | 1.009667 | 0.112976 | 0.493189 |
| Comp.5 | 0.999758 | 0.110769 | 0.603958 |
| Comp.6 | 0.972026 | 0.104709 | 0.708667 |
| Comp.7 | 0.969715 | 0.104212 | 0.812878 |
| Comp.8 | 0.940186 | 0.097962 | 0.910840 |
| Comp.9 | 0.896956 | 0.089160 | 1.000000 |
Table above suggests that there doesn’t seem to be a clear winner that captures most of the variability itself. Since most of the components have a similar proportion of the variance.
See table below to understand and compare the content of components
loadings_df = pd.DataFrame(pca.components_.T, columns=[f'Comp.{i+1}' for i in range(len(pca.explained_variance_))], index=pca_df.columns)
loadings_df
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | |
|---|---|---|---|---|---|---|---|---|---|
| length of patch | 0.319646 | -0.025815 | -0.595711 | 0.292983 | 0.233550 | -0.211242 | -0.288572 | -0.323017 | -0.411906 |
| height of patch | 0.059382 | -0.514137 | 0.230340 | -0.291507 | 0.372631 | -0.540260 | 0.266577 | 0.139445 | -0.269928 |
| height of solder resist layer | 0.269357 | -0.215348 | -0.217159 | -0.430974 | 0.431274 | 0.667127 | -0.018495 | 0.118387 | 0.052558 |
| radius of the probe | 0.004249 | -0.195689 | 0.662581 | 0.255579 | 0.157651 | 0.285254 | -0.400929 | -0.346496 | -0.265191 |
| c_pad | -0.259423 | -0.330291 | -0.146192 | 0.547369 | 0.014845 | 0.323602 | 0.607875 | 0.005726 | -0.167748 |
| c_antipad | -0.556287 | -0.245124 | -0.192633 | -0.074022 | 0.291205 | -0.115210 | -0.152863 | -0.461867 | 0.503101 |
| c_probe | 0.562431 | 0.034002 | 0.171857 | 0.401044 | 0.294562 | -0.131520 | 0.167223 | -0.014750 | 0.599916 |
| dielectric constant of substrate | -0.038179 | 0.604910 | 0.133207 | -0.206798 | 0.326084 | 0.019769 | 0.462814 | -0.461898 | -0.194545 |
| dielectric constant of solder resist layer | -0.356633 | 0.337754 | 0.011891 | 0.263499 | 0.563588 | -0.037246 | -0.222571 | 0.561443 | -0.073395 |
Plot below shows the variance explained by each component. 10% of the variance may be a good threshold to decide on the number of components to be used.
explained_variance = explained_variance_df["Proportion of Variance"]
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--', color='b')
plt.axhline(y=0.1, color='r', linestyle='--') # 10% explained variance threshold
plt.title("Explained Variance of Principal Components")
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.xticks(range(1, len(explained_variance) + 1))
plt.grid()
plt.show()
Plot below shows the cumulative variance explained by the components. It suggests that around 80% of the variance can be explained by 7 components.
cumulative_variance = explained_variance_df["Cumulative Proportion"]
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
plt.axhline(y=0.8, color='r', linestyle='--') # 80% cumulative explained variance threshold
plt.title('Cumulative Explained Variance of Principal Components')
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.xticks(range(1, len(cumulative_variance) + 1))
plt.grid()
plt.show()
Finally, the number of cpomponents is decided as 7. Then the data is transformed into the new space with the addition of the categorical feature.
transformed_input_df = pd.DataFrame(pca.transform(pca_df), columns=[f'PC{i+1}' for i in range(len(pca.explained_variance_))])
transformed_input_df.drop(columns=["PC9", "PC8"], inplace=True, axis=1) # Dropping the last two components
# Adding the categorical feature
transformed_input_df["width of patch combined with height of substrate"] = input_df_scaled["width of patch combined with height of substrate"]
transformed_input_df.head()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | width of patch combined with height of substrate | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.975511 | 2.172806 | 0.533331 | 0.267580 | -0.611399 | -0.373230 | -0.961117 | -0.798935 |
| 1 | -0.190132 | 1.213715 | 1.608164 | -1.023334 | 1.533051 | -1.074461 | 0.826188 | -0.798935 |
| 2 | 0.034671 | 1.308687 | 0.858340 | 1.833032 | 1.006886 | -0.753197 | 1.489438 | 1.251666 |
| 3 | -1.153707 | 1.155495 | -2.280813 | -2.088161 | 0.255904 | 0.704724 | -1.436043 | -0.798935 |
| 4 | 0.512146 | -1.516188 | -2.062709 | 0.632568 | 0.132800 | 1.102164 | -0.089013 | -0.798935 |
4. Comparison in Linear Models ¶
Real part of the output is partitioned into training and testing sets to assess the performance of linear models. The training set is utilized for model training, while the testing set is employed to evaluate the model’s performance on unseen data. The training set comprises 80% of the data, and the testing set comprises the remaining 20%. It is important to note that a predetermined random state is employed to make comparison between the sets easier.
# Splitting the data into train and test sets
X_train_transformed, X_test_transformed, y_train_min_real, y_test_min_real = train_test_split(transformed_input_df, min_values_real, test_size=0.2, random_state=5)
X_train, X_test, y_train_pca, y_test_pca = train_test_split(input_df_scaled, principal_components_real, test_size=0.2, random_state=5)
X_train_transformed = sm.add_constant(X_train_transformed)
X_train = sm.add_constant(X_train)
# Models
model_1 = sm.OLS(y_train_min_real, X_train).fit()
model_2 = sm.OLS(y_train_min_real, X_train_transformed).fit()
model_3 = sm.OLS(y_train_pca, X_train).fit()
model_4 = sm.OLS(y_train_pca, X_train_transformed).fit()
comparison_df = pd.DataFrame({
'Model': ['Min Real Values + Original Input', 'Min Real Values + PCA Components', 'PCA Real Values + Original Input', 'PCA Real Values + PCA Components'],
'Adjusted R-squared': [model_1.rsquared_adj, model_2.rsquared_adj, model_3.rsquared_adj, model_4.rsquared_adj],
'AIC': [model_1.aic, model_2.aic, model_3.aic, model_4.aic],
'BIC': [model_1.bic, model_2.bic, model_3.bic, model_4.bic],
})
comparison_df_transposed = comparison_df.T
comparison_df_transposed = comparison_df.set_index('Model').T
comparison_df_transposed
| Model | Min Real Values + Original Input | Min Real Values + PCA Components | PCA Real Values + Original Input | PCA Real Values + PCA Components |
|---|---|---|---|---|
| Adjusted R-squared | 0.921997 | 0.921547 | 0.945604 | 0.943062 |
| AIC | -353.133657 | -353.295040 | 1328.727284 | 1340.862193 |
| BIC | -312.102560 | -319.724142 | 1369.758381 | 1374.433091 |
As evident from the table, manipulating the input data doesn’t seem to have a significant impact on the outcomes. However, observing that PCA-applied inputs perform similarly to the original outputs implies that we were able to achieve comparable results using fewer inputs.
Also, PCA applied outputs seem to capture more variance compared to the minimum value approach since it has a slightly higher R-squared value and lower errors. This result is expected as PCA is a more sophisticated method that captures the variability in the data more effectively.
Now, let's inspect the best model in more detail. The plot below shows the predicted values and the actual values.
plt.figure(figsize=(12, 6))
plt.scatter(y_train_pca, model_3.predict(X_train), c="red")
plt.plot(y_train_pca, y_train_pca, color="blue")
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Predicted vs Actual values")
plt.show()
Plot suggests that model performs well in predicting some values but in general it lacks some kind of information to capture the dynamics of the data.
Now, let’s apply the same models to the imaginary part of the data.
# Splitting the data into train and test sets
X_train_transformed, X_test_transformed, y_train_min_real, y_test_min_real = train_test_split(transformed_input_df, min_values_img, test_size=0.2, random_state=16)
X_train, X_test, y_train_pca, y_test_pca = train_test_split(input_df_scaled, principal_components_img, test_size=0.2, random_state=16)
X_train_transformed = sm.add_constant(X_train_transformed)
X_train = sm.add_constant(X_train)
# Models
model_1 = sm.OLS(y_train_min_real, X_train).fit()
model_2 = sm.OLS(y_train_min_real, X_train_transformed).fit()
model_3 = sm.OLS(y_train_pca, X_train).fit()
model_4 = sm.OLS(y_train_pca, X_train_transformed).fit()
comparison_df = pd.DataFrame({
'Model': ['Min Img. Values + Original Input', 'Min Img. Values + PCA Components', 'PCA Img. Values + Original Input', 'PCA Img. Values + PCA Components'],
'Adjusted R-squared': [model_1.rsquared_adj, model_2.rsquared_adj, model_3.rsquared_adj, model_4.rsquared_adj],
'AIC': [model_1.aic, model_2.aic, model_3.aic, model_4.aic],
'BIC': [model_1.bic, model_2.bic, model_3.bic, model_4.bic],
})
comparison_df_transposed = comparison_df.T
comparison_df_transposed = comparison_df.set_index('Model').T
comparison_df_transposed
| Model | Min Img. Values + Original Input | Min Img. Values + PCA Components | PCA Img. Values + Original Input | PCA Img. Values + PCA Components |
|---|---|---|---|---|
| Adjusted R-squared | 0.321934 | 0.284523 | 0.580122 | 0.563698 |
| AIC | 109.881455 | 124.489754 | 1306.270180 | 1316.155739 |
| BIC | 150.912553 | 158.060652 | 1347.301277 | 1349.726637 |
Here, PCA applied outputs shows the effectiveness of the method. It captures more variance compared to the minimum value approach since it has a higher R-squared value and lower errors. This result is expected as discussed before.
Now, let's inspect the best model in more detail. The plot below shows the predicted values and the actual values. Plot shows that linear model can capture its distribution at some level but it lacks some kind of information to capture the dynamics of the data.
plt.figure(figsize=(12, 6))
plt.scatter(y_train_pca, model_3.predict(X_train), c="red")
plt.plot(y_train_pca, y_train_pca, color="blue")
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Predicted vs Actual values")
plt.show()
5. Conclusions ¶
In conclusion, this assignment aimed to investigate the connection between antenna geometry parameters and S11 parameter values. By employing Principal Component Analysis (PCA) on the output data, the number of output variables was reduced to a single principal component that collectively explained some part of the variance in the data. This approach proved advantageous as it simplified the data and enhanced the efficiency of the predictive models. The results of the linear models indicated that the PCA-derived outputs were more effective in capturing the variability in the data compared to the approach that utilized the minimum value. This finding suggests that PCA is a more sophisticated method capable of capturing the dynamics of the data more effectively. Nevertheless, the linear models were unable to fully capture the dynamics of the data, implying that more sophisticated machine learning techniques or more detailed manipulations in the input data may be necessary to enhance the predictive accuracy of the models.
6. Code ¶
Click here to access the code.